---
title: "Replication of Table 1 in Muris (2019, Section 6, Empirical application)"
output: html_notebook
---

## Setup

See Section 6 of the paper for more details about this exercise. The goal is to

1. replicate results in Topalova and Khandelwal (TK), Table 4, column 3
2. visualize the incomplete data patterns
3. implement the FD versions of 
      - TK Table 4, Column 3 (Available case)
      - TK Table 4, Column 4 (Complete case)
      - Efficient estimator developed in the paper
   using GMM.


Start by loading the packages required for this analysis. If any of these packages is not installed on your machine, use `install.packages` to install them. For example, if you get an error message that `Error in library(gmm) : there is no package called 'gmm'`, then tell `R` to `install.packages("gmm")`.

```{r}
library(tidyverse)
library(haven)
library(plm)
library(gmm)
library(mgcv)
library(Matrix)
library(cowplot)
```

## Load the TK data

Now load the data that was created by running `01-load-data.do` in the same folder as the files in the archive of Topalova and Khandelwal (2011), which can be obtain from  [REStat's Dataverse page](https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/18094).

```{r}
(india <- read_dta("trade_liberalization.dta"))
```

## Replicate TK Table 4 Column 3 (Available case)

We can use `plm` to run fixed effects and first-differenced regressions as in TK's Table 4, Column 3. Note that an indicator for the company is `companyname`. An indicator for the year is `yr`. The following chunk creates the panel data frame with those indicators and issues the `R`+`plm` equivalent of Stata's `lm(formula = tfp1000 ~ lagtariff + age + age2 + as.factor(yr) + as.factor(companyname), data = india)`.

```{r}
p_india <- pdata.frame(india, index=c("companyname","yr"), row.names=TRUE)
col3 <- plm(tfp1000 ~ lagtariff + age + age2 + factor(yr),
            data = p_india, 
            model = "within")
summary(col3)
```

The results are as in TK (up to the standard errors: `plm` does not cluster by default). As explained in the main text, I prefer a first-differenced estimator. The following chunk implements it, by changing the `model` option.

```{r}
col3_fd <- plm(tfp1000 ~ lagtariff + age + age2 + factor(yr),
            data = p_india, 
            model = "fd")
summary(col3_fd)
```

The FD results are close to the FE results (`-0.53` v `-0.46`)

```{r}
cbind(col3$coefficients,col3_fd$coefficients[-1])
```

Finally, I prefer to leave out `age`, as I find it hard to interpret its effect in the presence of firm + time effects.

```{r}
col3_fd_noage <- plm(tfp1000 ~ lagtariff + factor(yr),
            data = p_india, 
            model = "fd")
summary(col3_fd_noage)
```

A comparison of the three results so far shows that omitting `age` does not impact the estimated effect of tariffs.

```{r}
print(round(cbind(col3$coefficients,col3_fd$coefficients[-1],c(col3_fd_noage$coefficients[2],NA,NA,col3_fd_noage$coefficients[-c(1,2)])),3),
      nsmall = 3)
```

## Data availability plots

This section visualizes the patterns of incompleteness in the data from TK. First, a histogram with the number of available time periods per firm.

```{r}
patternplot1 <- india %>% group_by(companyname) %>% summarise( how_many = n()) %>% 
  ggplot() + geom_histogram(aes(x = how_many), binwidth = 1)
save_plot("pattern_1.pdf",patternplot1)

```

Also useful (not included in the paper): a plot with the number of companies in each time period.

```{r}
india %>% group_by(yr) %>% summarise( how_many = n()) %>% 
  ggplot(aes(x=yr, y = how_many)) + geom_line() + geom_point(size = 5)
```

An alternative way to see the unbalancedness of the data is to make the following plot, where each row corresponds to one of the first 20 firms (alphabetical order), and each column corresponds to a year. A dot in a position means that at least some measurements are available for that firm in that year.

A different plot we could make is the pure availability for the top 20 firms (keep in mind: some observations are still missing in this one!)

```{r}
patternplot2 <- india[1:180,] %>% ggplot() + geom_point(aes(x = yr, y = companyname))
save_plot("pattern_2.pdf",patternplot2, base_aspect_ratio = 2,base_height = 10)
```

## Optimal FD estimation via GMM

Columns 2, 4, and 5 in Table 1 of Muris (2019) are all based on the moment conditions in equation (6.1) of the paper. To obtain these results, we set up the moment conditions for the first differenced estimator. To that end, start by turning the panel data set into a cross-sectional one

```{r}
(cs_india <- india %>% select(
  companyname, yr, tfp1000, tariff, lagtariff # keep only a few variables
) %>% na.omit() %>% 
  gather(`tfp1000`, `tariff`, `lagtariff`, key = variable, value = value) %>% 
  unite(varname,variable,yr) %>%                 # create variables with names that reflect the year it was measured
  spread(varname,value))                         # and spread to a cross-section format
```

Next, generate differenced regressors. Abbreviating the RHS and LHS to `X` and `Y` to make the code below a bit easier to recycle.

```{r}
(cs_D_india <- cs_india %>% transmute(
  
  Y94 = tfp1000_1994,
  Y93 = tfp1000_1993,
  Y92 = tfp1000_1992,
  Y91 = tfp1000_1991,
  Y90 = tfp1000_1990,
  Y89 = tfp1000_1989,
  
  dY96 = tfp1000_1996 - tfp1000_1995,
  dY95 = tfp1000_1995 - tfp1000_1994,
  dY94 = tfp1000_1994 - tfp1000_1993,
  dY93 = tfp1000_1993 - tfp1000_1992,
  dY92 = tfp1000_1992 - tfp1000_1991,
  dY91 = tfp1000_1991 - tfp1000_1990,
  dY90 = tfp1000_1990 - tfp1000_1989,
  
  dX96 = lagtariff_1996 - lagtariff_1995,
  dX95 = lagtariff_1995 - lagtariff_1994,
  dX94 = lagtariff_1994 - lagtariff_1993,
  dX93 = lagtariff_1993 - lagtariff_1992,
  dX92 = lagtariff_1992 - lagtariff_1991,
  dX91 = lagtariff_1991 - lagtariff_1990,
  dX90 = lagtariff_1990 - lagtariff_1989
))
```

Here, the `gmm` package is going to do the heavy lifting in terms of computation, standard errors, etc. It expects the data in matrix format.

```{r}
Z <- as.matrix(cs_D_india)
Z[1:10,c(1:3,10:13)]
```

The following function implements the moment functions corresponding to the first differenced estimator. The `case` switch allows us to indicate whether we want to go for available case or complete case.

```{r}
topkha_fd <- function(theta,data,case){
  
  # Regression coefficient.
  beta = theta[1]
  
  # Difference in time dummies: will be constant per thing.
  dl90 = theta[2]
  dl91 = theta[3]
  dl92 = theta[4]
  dl93 = theta[5]
  dl94 = theta[6]
  dl95 = theta[7]
  dl96 = theta[8]
  
  dU96 = data[,"dY96"] - beta*data[,"dX96"] - dl96
  dU95 = data[,"dY95"] - beta*data[,"dX95"] - dl95
  dU94 = data[,"dY94"] - beta*data[,"dX94"] - dl94
  dU93 = data[,"dY93"] - beta*data[,"dX93"] - dl93
  dU92 = data[,"dY92"] - beta*data[,"dX92"] - dl92
  dU91 = data[,"dY91"] - beta*data[,"dX91"] - dl91
  dU90 = data[,"dY90"] - beta*data[,"dX90"] - dl90


  moments <- cbind(
    data[,"dX96"]*dU96,
    data[,"dX95"]*dU95,
    data[,"dX94"]*dU94,
    data[,"dX93"]*dU93,
    data[,"dX92"]*dU92,
    data[,"dX91"]*dU91,
    data[,"dX90"]*dU90,
    dU96,
    dU95,
    dU94,
    dU93,
    dU92,
    dU91,
    dU90
  )
  
  # Available case fix:
  if(case == "available"){
    moments[is.na(moments)] = 0
  }
  if(case == "complete"){
    moments <- moments[complete.cases(moments),]
  }
  
  # Return
  return(moments)
  
}

topkha_fd_cc <- function(theta,data){topkha_fd(theta,data,"complete")}
topkha_fd_ac <- function(theta,data){topkha_fd(theta,data,"available")}
```

The estimates in columns 2 and 4 of Table 1 can then be obtained via the `gmm` package.

```{r}
# topkha(Z,numeric(8))
out_cc <- gmm(g = topkha_fd_cc, x = Z, t0 = numeric(8), 
                        type = "twoStep", vcov = "iid",
                        optfct = "nlminb")
out_ac <- gmm(g = topkha_fd_ac, x = Z, t0 = numeric(8), 
                        type = "twoStep", vcov = "iid",
                        optfct = "nlminb")

summary(out_cc)
summary(out_ac)
```

## Intermezzo: identifying patterns

To program the moment function that leads to the efficient estimator, we need to identify the incomplete data patterns in this data set. For a given set of moments, the following tells us what the patterns are.

```{r}
missings <- topkha_fd(numeric(8)+0.024879,Z,"test_case")  # both arguments are arbitrary
missings[!is.na(missings)] = 1
missings[is.na(missings)] = 0
# Now extract the unique rows.
(patterns <- uniquecombs(missings,ordered=FALSE))
```

For each row, find out the pattern.

```{r}
which_pat <- function(x,patterns){
  index <- NA
  J <- nrow(patterns)
  for(j in 1:J){
    if(min(x == patterns[j,])){
      index <- j
    }
  }
  return(index)
}
strata <- apply(missings,1,which_pat,patterns = patterns)
as_tibble(strata) %>% group_by(value) %>% summarize(count = n()) %>% arrange(desc(count))
```

The table above shows that most patterns do not occur frequently. Leave out the patterns that occur 20 times or less.

```{r}
step1 <- as_tibble(strata) %>% group_by(value) %>% summarize(count = n()) %>% arrange(desc(count)) %>% filter(count>20)
use_these_patterns <- step1$value
patterns[use_these_patterns,]
```

## Efficient moment functions

Now, we can construct moment conditions for the efficient estimator by stacking the moment conditions for the available case estimator... The following chunk does that. First, it obtains the observable components of the moment functions for each stratum (via `topkha_fd_stratum`). Then, it stacks them across all patterns using `topkha_fd_efficient`.

```{r}
# Remember: `Z` has the data.
#           `strata` has the corresponding stratum indicators
#           `patterns` has the indicator for the moment functions observed in that stratum.
topkha_fd_stratum <- function(stratum,theta,data){
  bare <- topkha_fd(theta,data[strata == stratum,],"case")
  return(
    bare[,patterns[stratum,] == 1]
  )
}
topkha_fd_efficient <- function(theta,data){
  all_moments <- bdiag(use_these_patterns %>% map(topkha_fd_stratum,theta = theta, data = data))
  return(as.matrix(all_moments))
}
```

Feed this into `gmm`.

```{r}
out_eff <- gmm(g = topkha_fd_efficient, x = Z, t0 = numeric(8), 
                        type = "twoStep", vcov = "iid",
                        optfct = "nlminb")
summary(out_eff)
```

## Inference

The paper presents standard errors via the bootstrap. I omit the code here because it is computationally expensive. Implementation is done via sampling with replacement from the original data set and applying the procedure above ("non-parametric bootstrap").

